library("tidyverse")
library("readr")
library("broom")
source("99_proj_func.R") # custom functions and theme05_model
Load data
We load the data that was augmented for modelling:
load("../data/04_data_for_modelling.Rdata")Modelling tumor size as a function of variables
Firstly a preliminary analysis of linear model of variables against tumor size as an indicator of disease progression is done.
df_long_nested_model <- df_long_nested |>
mutate(model_object = map(data,
~lm(size_of_primary_tumor ~ value,
data = .x))) |>
mutate(tidy_model = map(model_object, ~tidy(.x,
conf.int = TRUE,
conf.level = 0.95))) |>
unnest(cols = tidy_model) |>
filter(term == "value") |> #discard intercept terms
mutate(p_adjusted = p.adjust(p.value)) |> # correct for multiple testing
arrange(p_adjusted) #sort pvalues to reveal highest significance
df_long_nested_model |>
select(-data, -model_object, -term, -std.error, -statistic) |>
print()# A tibble: 16 × 6
# Groups: variable [16]
variable estimate p.value conf.low conf.high p_adjusted
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 combined_index 2.32 7.08e-18 1.81 2.83 7.08e-18
2 stage 6.71 1.67e- 9 4.57 8.86 1.67e- 9
3 bone_metastases 8.97 2.90e- 9 6.06 11.9 2.90e- 9
4 serum_prostatic_acid_phospha… 0.116 2.63e- 5 0.0623 0.170 2.63e- 5
5 follow_up_months -0.0955 5.96e- 5 -0.142 -0.0492 5.96e- 5
6 is_alive 4.43 2.95e- 4 2.04 6.81 2.95e- 4
7 medical_history -2.66 1.84e- 2 -4.87 -0.450 1.84e- 2
8 serum_hemoglobin -0.663 2.03e- 2 -1.22 -0.103 2.03e- 2
9 performance 2.68 4.24e- 2 0.0922 5.28 4.24e- 2
10 dia_blood_pressure -0.467 2.24e- 1 -1.22 0.286 2.24e- 1
11 weight -0.0472 2.63e- 1 -0.130 0.0355 2.63e- 1
12 sys_blood_pressure 0.222 3.37e- 1 -0.231 0.675 3.37e- 1
13 ekg 0.158 4.20e- 1 -0.226 0.542 4.20e- 1
14 dosage -0.137 6.21e- 1 -0.679 0.406 6.21e- 1
15 treatment -0.161 9.00e- 1 -2.68 2.35 9.00e- 1
16 age 0.00851 9.15e- 1 -0.148 0.165 9.15e- 1
Based on these models it appears that most variables have an independent correlation with tumor size. This will be illustrated below.
Volcano plot
A volcano plot shows our estimate slope vs the p-value.
(df_long_nested_model |>
ggplot(mapping = aes(x = estimate,
y = -log10(p_adjusted),
color = factor(p_adjusted < 0.05))) +
geom_point(alpha = 0.9) +
geom_hline(yintercept = 0)+
geom_vline(xintercept = 0)+
best_theme() +
theme(legend.position = "none") +
ggrepel::geom_text_repel(aes(label = case_when(p_adjusted < 0.05 ~ variable,
TRUE ~ ""))) +
labs(
title = "Estimate vs -log10 of adjusted p-value",
y = "-log10 of p-values adjusted for multiple testing",
x = "estimate")) |>
save_and_show("linear_volcano")Saving 7 x 5 in image
Forest plot
The forest plot shows the confidence interval of the modeled slope for all variables
(forest_plot <- df_long_nested_model |>
filter(p_adjusted < 0.05) |>
ggplot(mapping = aes(x = estimate,
y = fct_reorder(variable, estimate),#sort by the slope
xmin = conf.low,
xmax = conf.high)) +
geom_errorbar(orientation = "y") +
geom_point() +
geom_vline(xintercept = 0) +
theme_minimal(base_size = 12) +
labs(
x = "Estimates (95% CIs)",
y = "Variable",
title = "Variables with significant impact on tumor size"
) +
best_theme()) |>
save_and_show("linear_forest_plot")Saving 7 x 5 in image
Residual QQ-plots
Here we seek to validate if the linear models make sense of if there is a depature from expected normality.
df_resids <- df_long_nested_model |>
mutate(aug = map(model_object, broom::augment)) |>
unnest(aug)(ggplot(df_resids, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~ variable, scales = "free") +
theme_minimal()) |>
save_and_show("linear_qq")Saving 7 x 5 in image
Evident from the QQ-plots are a distinct departure from the normal-distribution in tails. This is present in each model and therefore it is likely that tumor size has outliers. This is investigated following by a histogram and QQ-plot.
(df_num |>
ggplot() +
geom_histogram(aes(x = size_of_primary_tumor),
color = "black", alpha = 0.8, fill = "grey") +
best_theme()) |>
save_and_show("tumor_histogram")Saving 7 x 5 in image
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
(df_num |>
ggplot(aes(sample = size_of_primary_tumor)) +
stat_qq() +
stat_qq_line(color = "salmon") +
best_theme()) |>
save_and_show("tumor_qq")Saving 7 x 5 in image
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq_line()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq_line()`).
These two plots confirm a definite departure from normality, indicative that log-transformed data would make a better fit.
(df_num |>
filter(size_of_primary_tumor > 0) |>
ggplot(aes(sample = log(size_of_primary_tumor))) +
stat_qq() +
stat_qq_line(color = "salmon") +
best_theme()) |>
save_and_show("log_tumor_qq")Saving 7 x 5 in image
Evident from the model is a much closer normal distribution even if an S-shape is present this is likely due to a few outliers, but overall it reveals a logarithmic scale may suit the data better.
Modelling based on log transformed data
As just confirmed it is to be investigated wether a log-transformation of our data improves correlations. This requires dropping patients with tumor size zero.
df_long_log <- df_long_nested |>
unnest(cols = c(data)) |>
filter(size_of_primary_tumor > 0)
df_long_log |> slice_sample(n = 10)# A tibble: 160 × 4
# Groups: variable [16]
variable patient_no size_of_primary_tumor value
<chr> <dbl> <dbl> <dbl>
1 age 232 62 72
2 age 75 47 65
3 age 10 21 78
4 age 189 4 73
5 age 454 17 88
6 age 191 7 72
7 age 505 32 82
8 age 233 25 74
9 age 212 3 76
10 age 214 11 69
# ℹ 150 more rows
Nesting the model objects for cleanliness.
df_long_log_nest <- df_long_log |>
group_by(variable) |>
nest()Model of logarithmic tumor size:
df_long_nested_model_log <- df_long_log_nest |>
mutate(model_object = map(data,
~lm(log(size_of_primary_tumor) ~ value,
data = .x))) |>
mutate(tidy_model = map(model_object, ~tidy(.x,
conf.int = TRUE,
conf.level = 0.95))) |>
unnest(cols = tidy_model) |>
filter(term == "value") |> #discard intercept terms
mutate(p_adjusted = p.adjust(p.value)) |> # correct for multiple testing
arrange(p_adjusted) #sort pvalues to reveal highest significance
df_long_nested_model_log |>
select(-data, -model_object, -term, -std.error, -statistic) |>
print()# A tibble: 16 × 6
# Groups: variable [16]
variable estimate p.value conf.low conf.high p_adjusted
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 combined_index 1.58e-1 6.12e-14 0.118 0.198 6.12e-14
2 stage 4.69e-1 4.28e- 8 0.303 0.634 4.28e- 8
3 bone_metastases 6.16e-1 8.95e- 8 0.393 0.839 8.95e- 8
4 serum_prostatic_acid_phospha… 8.46e-3 5.56e- 5 0.00437 0.0126 5.56e- 5
5 is_alive 2.95e-1 1.69e- 3 0.111 0.478 1.69e- 3
6 follow_up_months -5.63e-3 2.10e- 3 -0.00921 -0.00205 2.10e- 3
7 serum_hemoglobin -5.41e-2 1.34e- 2 -0.0969 -0.0113 1.34e- 2
8 medical_history -2.11e-1 1.44e- 2 -0.380 -0.0423 1.44e- 2
9 sys_blood_pressure 1.93e-2 2.73e- 1 -0.0153 0.0540 2.73e- 1
10 dosage -2.27e-2 2.85e- 1 -0.0643 0.0190 2.85e- 1
11 dia_blood_pressure -2.73e-2 3.50e- 1 -0.0848 0.0301 3.50e- 1
12 ekg 1.15e-2 4.45e- 1 -0.0180 0.0409 4.45e- 1
13 performance 7.31e-2 4.68e- 1 -0.125 0.271 4.68e- 1
14 treatment -3.26e-2 7.40e- 1 -0.225 0.160 7.40e- 1
15 age -1.40e-3 8.18e- 1 -0.0133 0.0105 8.18e- 1
16 weight 1.01e-4 9.75e- 1 -0.00624 0.00644 9.75e- 1
If we compare these p-values with earlier linear model we see a much more significant correlation between these and log-transformed tumor size. This will be validated with residual QQ-plots.
Preparing for residual plots
df_resids_log <- df_long_nested_model_log |>
mutate(aug = map(model_object,
broom::augment)) |>
unnest(aug)(ggplot(df_resids_log, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~ variable, scales = "free") +
theme_minimal()) |>
save_and_show("log_qq")Saving 7 x 5 in image
Aligning with our log-transformed tumor data the variables now also show a smaller from expected expectation normality.
Volcano plot
A volcano plot shows our estimate slope vs the p-value.
(
df_long_nested_model_log |>
ggplot(mapping = aes(x = estimate,
y = -log10(p_adjusted),
color = factor(p_adjusted < 0.05))) +
geom_point(alpha = 0.9) +
geom_hline(yintercept = 0)+
geom_vline(xintercept = 0)+
best_theme() +
theme(legend.position = "none") +
ggrepel::geom_text_repel(aes(label = case_when(p_adjusted < 0.05 ~ variable,
TRUE ~ ""))) +
labs(
title = "Estimate vs -log10 of adjusted p-value",
y = "-log10 of p-values adjusted for multiple testing",
x = "estimate")) |>
save_and_show(plot_name = "log_volcano")Saving 7 x 5 in image
Forest plot
The forest plot shows the confidence interval of the modelled slope for all variables
(df_long_nested_model_log |>
filter(p_adjusted < 0.05) |>
ggplot(mapping = aes(x = estimate,
y = fct_reorder(variable, estimate),#sort byr the skope
xmin = conf.low,
xmax = conf.high)) +
geom_errorbar(orientation = "y") +
geom_point() +
geom_vline(xintercept = 0) +
theme_minimal(base_size = 12) +
labs(
x = "Estimates (95% CIs)",
y = "Variable",
title = "Variables with significant impact on tumor size"
) +
best_theme()) |>
save_and_show("log_forest")Saving 7 x 5 in image
Correlation among continuous clinical variables
Beyond comparison with tumor size a pairwise correlation for the quantitative biological and physiological measurements is computed to identify potential collinearity before survival analysis.
# Select relevant numeric variables
corr_df <- df_num |>
select(
age,
weight,
serum_hemoglobin,
serum_prostatic_acid_phosphatase,
size_of_primary_tumor)
# Compute raw correlation matrix
corr_mat <- cor(corr_df, use = "pairwise.complete.obs")Correlation heatmap
# Convert matrix to long format for ggplot2
corr_long <- as.data.frame(corr_mat) |>
rownames_to_column("Var1") |>
pivot_longer(
cols = -Var1,
names_to = "Var2",
values_to = "cor"
)
corr_long |>
ggplot(aes(x = Var2, y = Var1, fill = cor)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", cor)), size = 3) +
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
midpoint = 0,
limits = c(-1, 1),
name = "Correlation"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # retate words
panel.grid = element_blank() # remove grid lines
) +
labs(
title = "Correlation Matrix of Selected Variables",
x = NULL,
y = NULL
)This correlation matrix helps to identify redundant predictors. All the relevances is below 0.8, which indicates the selected variables can be seen as independent and used in a survival analysis. Given previous analysis it can also be concluded that shown variables
Exploratory multivariate analysis
Expanding on colliniarity and variables that have been shown significant in relation to log(size_of_primary_tumor) a few multivariate models are explored.
model_tidy <- lm(log(size_of_primary_tumor) ~
serum_hemoglobin +
medical_history +
bone_metastases +
serum_prostatic_acid_phosphatase,
data = df_num |>
filter(size_of_primary_tumor > 0)) |>
tidy()
print(model_tidy)# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.57 0.310 8.30 1.09e-15
2 serum_hemoglobin -0.0206 0.0221 -0.930 3.53e- 1
3 medical_history -0.187 0.0837 -2.24 2.57e- 2
4 bone_metastases 0.463 0.126 3.68 2.62e- 4
5 serum_prostatic_acid_phosphatase 0.00473 0.00225 2.10 3.61e- 2
Logistic regression predicting survivability (bad results)
A binary logistic regression is perfomed on the is_alive column. This is of limited biological relevanc but may reveal an interesting insight. interesting cols: select(is_alive, age, weight, size_of_primary_tumor,serum_prostatic_acid_phosphatase, performance)
# Turn df is_alive column into numeric variable for modelling:
df_a_ts <- df_num |>
select(is_alive, size_of_primary_tumor) |>
drop_na()
df_a_ts |>
ggplot(aes(y = is_alive,
x = size_of_primary_tumor)) +
geom_point() +
labs(
x = "Size of primary tumor",
y = "Alive at time of study (1: YES, 0: NO)",
title = "Being alive vs tumor size"
) +
best_theme()There seems to be a threshold for tumor size vs being alive at the time of study.
Fitting the logistic regression
my_glm_mdl <- glm(formula = is_alive ~ size_of_primary_tumor,
family = binomial(link = "logit"),
data = df_num)
my_glm_mdl
Call: glm(formula = is_alive ~ size_of_primary_tumor, family = binomial(link = "logit"),
data = df_num)
Coefficients:
(Intercept) size_of_primary_tumor
0.4333 0.0339
Degrees of Freedom: 490 Total (i.e. Null); 489 Residual
(5 observations deleted due to missingness)
Null Deviance: 592.4
Residual Deviance: 578.1 AIC: 582.1
df_a_ts |>
mutate(my_glm_model = pluck(my_glm_mdl, "fitted.values")) |>
ggplot(aes(x = size_of_primary_tumor, y = is_alive)) +
geom_point(alpha = 0.5,
size = 3) +
geom_line(aes(y = my_glm_model)) +
labs(
x = "Size of primary tumor",
y = "Alive at time of study (0: YES, 1: NO)",
title = "Being alive vs tumor size"
) +
best_theme()As mentioned there seems to be a certain correlation between these two but it is not a very telling analysis so it should not be considered.
Multivariate on measurements – No significant conclusions
Mentioned preivously this is an exploration of blood and treatment data to evaluate whether it could have an explanatory effect. This is conducted due to QQ-plots showing some variation and S-tailing. First explored is the tumor size, and potential effects of treatment, AP and hemoglobin.
ap_ts_log_tidy_model <- lm(
log(size_of_primary_tumor) ~
log(serum_prostatic_acid_phosphatase) +
serum_prostatic_acid_phosphatase +
serum_hemoglobin +
dosage +
bone_metastases +
stage,
data = df_num |>
filter(size_of_primary_tumor > 0))
print(ap_ts_log_tidy_model |>
tidy())# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.06 0.555 5.50 6.05e-8
2 log(serum_prostatic_acid_phosphatase) 0.242 0.0543 4.46 1.01e-5
3 serum_prostatic_acid_phosphatase -0.00546 0.00298 -1.83 6.76e-2
4 serum_hemoglobin -0.0268 0.0217 -1.24 2.17e-1
5 dosage -0.0297 0.0200 -1.48 1.39e-1
6 bone_metastases 0.261 0.136 1.92 5.51e-2
7 stage -0.118 0.132 -0.892 3.73e-1
Disregarding non-significant values:
ap_ts_log_tidy_model |>
tidy() |>
mutate(padj = p.adjust(p.value, method = "fdr")) |>
filter(padj < 0.05) |>
save_and_show_table("log_tumor_sig_fit")# A tibble: 2 × 6
term estimate std.error statistic p.value padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.06 0.555 5.50 6.05e-8 4.23e-7
2 log(serum_prostatic_acid_phospha… 0.242 0.0543 4.46 1.01e-5 3.55e-5
Seems that log(AP) does have a significant correlation with tumor size but not more than the log tumor size.
Exploratory analysis of effects on prostatic acid phosphatase
ap_es_log_tidy_model <- lm(
serum_prostatic_acid_phosphatase ~
dosage +
I(dosage^2) +
I(dosage^2)*serum_hemoglobin +
bone_metastases*dosage +
log(size_of_primary_tumor),
data = df_num |> filter(size_of_primary_tumor > 0))
print(ap_es_log_tidy_model |> tidy())# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.16 7.55 1.21 0.226
2 dosage 0.259 2.77 0.0933 0.926
3 I(dosage^2) 1.30 0.802 1.62 0.106
4 serum_hemoglobin -0.794 0.516 -1.54 0.124
5 bone_metastases 19.6 3.18 6.16 0.00000000156
6 log(size_of_primary_tumor) 2.00 0.918 2.17 0.0302
7 I(dosage^2):serum_hemoglobin -0.0951 0.0430 -2.21 0.0274
8 dosage:bone_metastases -0.341 1.15 -0.298 0.766
Again filtering for non-significant contributors.
ap_es_log_tidy_model |>
tidy() |>
mutate(padj = p.adjust(p.value, method = "fdr")) |>
filter(padj < 0.05) |>
print()# A tibble: 1 × 6
term estimate std.error statistic p.value padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bone_metastases 19.6 3.18 6.16 0.00000000156 0.0000000125
We see the primary contributors to the AP levels to be bone metastasis, a relevant conclusion as this is prevalent in advanced diagnosis.
Exploratory analysis of effects on bone metastasis
We wanted to investigate the correlation between bone metastatis and blood factors.
bm_ap_hb_model <- lm(
bone_metastases ~
serum_prostatic_acid_phosphatase*I(dosage**2) +
serum_prostatic_acid_phosphatase*serum_hemoglobin +
dosage*serum_hemoglobin,
data = df_num)
print(bm_ap_hb_model |> tidy())# A tibble: 8 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.78e-1 0.129 5.25 2.23e-7
2 serum_prostatic_acid_phosphatase -1.22e-2 0.00363 -3.35 8.75e-4
3 I(dosage^2) -1.87e-2 0.00902 -2.08 3.83e-2
4 serum_hemoglobin -4.68e-2 0.00950 -4.92 1.18e-6
5 dosage 9.47e-2 0.0695 1.36 1.74e-1
6 serum_prostatic_acid_phosphatase:I(dosag… -2.66e-5 0.0000654 -0.407 6.84e-1
7 serum_prostatic_acid_phosphatase:serum_h… 1.81e-3 0.000336 5.38 1.14e-7
8 serum_hemoglobin:dosage 9.84e-4 0.00391 0.252 8.01e-1
bm_ap_hb_model |>
tidy() |>
mutate(padj = p.adjust(p.value, method = "fdr")) |>
filter(padj < 0.05) |>
save_and_show_table("bone_sig_fit")# A tibble: 4 × 6
term estimate std.error statistic p.value padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.678 0.129 5.25 2.23e-7 8.90e-7
2 serum_prostatic_acid_phosphatase -0.0122 0.00363 -3.35 8.75e-4 1.75e-3
3 serum_hemoglobin -0.0468 0.00950 -4.92 1.18e-6 3.14e-6
4 serum_prostatic_acid_phosphatase… 0.00181 0.000336 5.38 1.14e-7 8.90e-7
Here we can confirm that bone metastasis is indeed influenced by both AP and hemoglobin levels but not estrogen treatment dosage.
AP levels by bone metastasis status
As just shown serum AP is commonly elevated in advanced prostate cancer and is often used as a surrogate marker for metastatic activity. We compare AP levels between patients with and without bone metastasis and draw a boxplot to show the result.
df_num |>
ggplot(aes(
x = factor(bone_metastases),
y = serum_prostatic_acid_phosphatase)) +
geom_boxplot() +
scale_y_log10() + # for log scale
labs(
x = "Bone metastases",
y = "AP (log scale)",
title = "AP levels (log scale) by bone metastases status"
)+
theme(plot.title = element_text(hjust = 0.5) # keep titles in the middle
)We can see that AP levels are expected to be substantially higher in patients with bone metastases, reinforcing AP as a marker of advanced disease. This observation is consistent with the extreme AP values identified earlier in 04_describe.
Interpretation
In summary, several variables have been confirmed to be significantly associated with tumor size, and therefore with disease severity. In addition, correlations between key variables such as metastasis and AP levels have been confirmed. Along with this confirmation correlations for use in subsequent survival analysis have been confirmed. Blood-derived factors showed the expected association with the presence of bone metastases, but they did not show significant correlation with each other, nor with treatment.